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Abstract 

Detecting outliers which are grossly different from or inconsistent with the 
remaining dataset is a major challenge in real-world KDD applications. Exist- 
ing outlier detection methods are ineffective on scattered real-world datasets 
due to implicit data patterns and parameter setting issues. We define a novel 
Local Distance-based Outlier Factor (LDOF) to measure the outlier-ness of 
objects in scattered datasets which addresses these issues. LDOF uses the 
relative location of an object to its neighbours to determine the degree to 
which the object deviates from its neighbourhood. Properties of LDOF are 
theoretically analysed including LDOF's lower bound and its false-detection 
probability, as well as parameter settings. In order to facilitate parameter 
settings in real-world applications, we employ a top-n technique in our outlier 
detection approach, where only the objects with the highest LDOF values are 
regarded as outliers. Compared to conventional approaches (such as top-n 
KNN and top-n LOF), our method top-n LDOF is more effective at detecting 
outliers in scattered data. It is also easier to set parameters, since its perfor- 
mance is relatively stable over a large range of parameter values, as illustrated 
by experimental results on both real-world and synthetic datasets. 
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1 Introduction 



Of all the data mining techniques that are in vogue, outlier detection comes closest to 
the metaphor of mining for nuggets of information in real-world data. It is concerned 
with discovering the exceptional behavior of certain objects |TCFC02] . Outlier de- 
tection techniques have widely been applied in medicine (e.g. adverse reactions anal- 
ysis), finance (e.g. financial fraud detection), security (e.g. counter-terrorism), in- 
formation security (e.g. intrusions detection) and so on. In the recent decades, many 
outlier detection approaches have been proposed, which can be broadly classified into 
several categories: distribution-based [Bar94j . depth-based [Tuk77] . distance-based 
(e.g. KNN) [KN98] . cluster-based (e.g. DBSCAN) |EKSX96] and density-based 
(e.g. LOF) jBKNSnnj methods. 

However, these methods are often unsuitable in real-world applications due to 
a number of reasons. Firstly, real-world data usually have a scattered distribution, 
where objects are loosely distributed in the domain feature space. That is, from a 
'local' point of view, these objects cannot represent explicit patterns (e.g. clusters) 
to indicate normal data 'behavior'. However, from a 'global' point of view, scattered 
objects constitute several mini-clusters, which represent the pattern of a subset of 
objects. Only the objects which do not belong to any other object groups are genuine 
outliers. Unfortunately, existing outlier definitions depend on the assumption that 
most objects are crowded in a few main clusters. They are incapable of dealing with 
scattered datasets, because mini-clusters in the dataset evoke a high false-detection 
rate (or low precision). 

Secondly, it is difficult in current outlier detection approaches to set accurate 
parameters for real-world datasets . Most outlier algorithms must be tuned through 
trial-and-error [FZ FW06] . This is impractical, because real- world data usually do 
not contain labels for anomalous objects. In addition, it is hard to evaluate detection 
performance without the confirmation of domain experts. Therefore, the detection 
result will be uncontrollable if parameters are not properly chosen. 

To alleviate the parameter setting problem, researchers proposed top-n style out- 
lier detection methods. Instead of a binary outlier indicator, top-n outlier methods 
provide a ranked list of objects to represent the degree of 'outlier-ness' for each 
object. The users (domain experts) can re-examine the selected top-n (where n is 
typically far smaller than the cardinality of dataset) anomalous objects to locate 
real outliers. Since this detection procedure can provide a good interaction between 
data mining experts and users, top-n outlier detection methods become popular in 
real-world applications. 

Distance-based, top-n i^-Nearest Neighbour distance [RRSOOj is a typical top-n 
style outlier detection approach. In order to distinguish from the original distance- 
based outlier detection method in [KN98j . we denote i^ t/l -Nearest Neighbour dis- 
tance outlier as top-n KNN in this paper. In top-n KNN outlier, the distance from 
an object to its k th nearest neighbour (denoted as /c-distance for short) indicates 
outlier-ness of the object. Intuitively, the larger the fc-distance is, the higher outlier- 
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ness the object has. Top-n KNN outlier regards the n objects with the highest 
values of fc-distance as outliers [RRSOOJ. 

A density-based outlier, Local Outlier Factor (LOF) [BKNSOOJ, was proposed in 
the same year as top-n KNN. In LOF, an outlier factor is assigned for each object 
w.r.t its surrounding neighbourhood. The outlier factor depends on how the data 
object is closely packed in its locally reachable neighbourhood [FZFW06] . Since LOF 
uses a threshold to differentiate outliers from normal objects [BKNSOO], the same 
problem of parameter setting arises. A lower outlier-ness threshold will produce 
high false-detection rate, while a high threshold value will result in missing genuine 
outliers. In recent real-world applications, researchers have found it more reliable 
to use LOF in a top-n manner |TCFC02| . i.e. only objects with the highest LOF 
values will be considered outliers. Hereafter, we call it top-n LOF. 

Besides top-n KNN and top-n LOF, researchers have proposed other methods 
to deal with real- world data, such as the connectivity-based (COF) |TCFC02"] . and 
Resolution cluster-based (RB-outlier) |FZFW06] . Although the existing top-n style 
outlier detection techniques alleviate the difficulty of parameter setting, the detec- 
tion precision of these methods (in this paper, we take top-n KNN and top-n LOF 
as typical examples) is low on scattered data. In Section El we will discuss further 
problems of top-n KNN and top-n LOF. 

In this paper we propose a new outlier detection definition, named Local 
Distance-based Outlier Factor (LDOF), which is sensitive to outliers in scattered 
datasets. LDOF uses the relative distance from an object to its neighbours to mea- 
sure how much objects deviate from their scattered neighbourhood. The higher the 
violation degree an object has, the more likely the object is an outlier. In addi- 
tion, we theoretically analyse the properties of LDOF, including its lower bound 
and false-detection probability, and provide guidelines for choosing a suitable neigh- 
bourhood size. In order to simplify parameter setting in real-world applications, the 
top-n technique is employed in our approach. To validate LDOF, we perform vari- 
ous experiments on both synthetic and real-world datasets, and compare our outlier 
detection performance with top-n KNN and top-n LOF. The experimental results 
illustrate that our proposed top-n LDOF represents a significant improvement on 
outlier detection capability for scattered datasets. 

The paper is organised as follows: In Section [21 we illustrate and discuss the 
problems of top-n KNN and top-n LOF on a real-world data. In Section [3], we for- 
mally introduce the outlier definition of our approach, and mathematically analyse 
properties of our outlier-ness factor in Section HI In Section \5[ the top-n LDOF 
outlier detection algorithm is described, together with an analysis of its complexity. 
Experiments are reported in Section [6j which show the superiority of our method 
to previous approaches, at least on the considered datasets. Finally, conclusions are 
presented in Section [0 
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2 Problem Formulation 



In real-world datasets, high dimensionality (e.g. 30 features) and sparse feature 
value range usually cause objects to be scattered in the feature space. The scattered 
data is similar to the distribution of stars in the universe. Locally, they seem to be 
randomly allocated in the night sky (i.e. stars observed from the Earth), whereas 
globally the stars constitute innumerable galaxies. Figure Ufa) illustrates a 2-D 
projection of a real-world dataset, Wisconsin Diagnostic Breast Cancer (WDBC) 1 , 
which is typically 30-D. The green points are the benign diagnosis records (regarded 
as normal objects), and the red triangles are malignant diagnosis records (i.e. out- 
liers we want to capture). Obviously, we cannot detect these outliers in 2-D space, 
whereas in high dimension (e.g. 30-D), these scattered normal objects constitute a 
certain number of loosely bounded mini-clusters, and we are able to isolate genuine 
outliers. Unlike galaxies, which always contain billions of stars, these mini-clusters 
in scattered datasets usually have a relatively small number of objects. Figure W[b) 
is a simple demonstration of this situation, where C\ is a well-shaped cluster as 
we usually define in other outlier detection methods. C% and C3 are comprised of 
scattered objects with loose boundary, called mini-clusters. These small clusters 
should be recognised as 'normal', even if they contain a small number of objects. 
The objects of our interest are the points lying far away from other mini-clusters. 
Intuitively, oj., 02, 03, 04 are outliers in this sample. We recall a well accepted infor- 
mal outlier definition proposed by Hawkins |Haw80j : 11 An outlier is an observation 
that deviates so much from other observations as to arouse suspicion that it was 
generated by a different mechanism" . In scattered datasets, an outlier should be an 
object deviating from any other group of objects. 

The only way in which our outlier definition differs from others (e.g. in |KN98j 
and [BKNS00J) is that the normal pattern of data is represented by scattered ob- 
jects, rather than crowded main clusters. The neighbourhood in scattered real- world 
datasets has two characteristics: (1) objects in mini-clusters are loosely distributed; 
(2) when neighbourhood size k is large, two or more mini-clusters are taken into 
consideration. The neighbourhood becomes sparse as more and more objects which 
belong to different mini-clusters should be taken into account. 

As discussed above, top-n KNN and top-n LOF are ineffective for scattered 
datasets. Take a typical example, in Figure [H(b), when k is greater than the cardi- 
nality of C3 (10 in this case), some objects in C\ become neighbours of the objects in 
C3. Hence, for top-n KNN, the fc-distance of the object can be larger than genuine 
outliers. For top-n LOF, since the density of C3 is smaller than that of C\, it also 
fails for ranking 01, 02, 03 and 04 in the highest outlier- ness positions. In Section [61 
we will demonstrate that the two methods fail to detect genuine outliers when k 
grows greater than 10. 

Intuitively, it is more reasonable to measure how an object deviates from its 
neighbourhood system as an outlier-ness factor rather than global distance (top-n 



1 WDBC dataset is from UCI ML Repository: http://archive.ics.uci.edu/ml 
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(a) 2-D projection of WDBC. (b) Synthetic 2-D data. 

Figure 1: (a) The 2-D projection of a real-world dataset. (b) Simple 2-D illustration. 



KNN) or local density (top-n LOF). Thereby, we propose LDOF to measure the 
degree of neighbourhood violation. The formal definition of LDOF is introduced in 
the following section. 



3 Formal Definition of Local Distance-based Out- 
liers 

In this section, we develop a formal definition of the Local Distance-based Outlier 
Factor, which avoids the shortcomings presented above. 

Definition 1 (KNN distance of x p ) LetM p be the set of the k-nearest neighbours 
of object x p (excluding x p ). The k-nearest neighbours distance of x p equals the av- 
erage distance from x p to all objects in M p . More formally, let dist(x,x') > be 
a distance measure between objects x and x' . The k-nearest neighbours distance of 
object defined as 

1 y 

d Xp . — ^ ^ disti^Xi , Xp) . 

Definition 2 (KNN inner distance of x p ) Given the k-nearest neighbours set 
M p of object x p , the k-nearest neighbours inner distance of x p is defined as the 
average distance among objects in Af p : 



Dxp := k(k-l) dis^xv). 



Definition 3 (LDOF of x p ) The local distance-based outlier factor of x p is defined 
as: 

LDOF k (x p ) := -S- 
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Figure 2: (a) An anomalous object x p with scattered neighbours, (b) The explicit 
outlier-ness of object x p with the help of LDOF definition. A is the center of neigh- 
bourhood system of x p . The dashed circle includes all neighbours of x p . The solid 
circle is x p s "reformed" neighbourhood region. 



If we regard the fc-nearest neighbours as a neighbourhood system, LDOF cap- 
tures the degree to which object x p deviates from its neighbourhood system. It has 
the clear intuitive meaning that LDOF is the distance ratio indicating how far the 
object x p lies outside its neighbourhood system. When LDOF < 1, it means that 
x p is surrounded by a data 'cloud'. On the contrary, when LDOF ^> 1, x p is outside 
the whole 'cloud'. It is easy to see that the higher LDOF is, the farther x p is away 
from its neighbourhood system. 

To further explain our definition, we exemplify it in Euclidian space. Hereinafter, 
let Xi G X = M d , and x : = £ 5^ x <eAr Xi - ^ or ^ ne sc L uare d Euclidian distance || ■ || 2 , 
the outlier definition can be written as: 



dx v 



D x 

Xp 



k(k - 1) H^-^'H 2 = y^i ^ r ll^^^ll 2 - ( 2 ) 



Thus, LDOFk(x p ) 3> 1, i.e. x p lies outside its neighbourhood system, iff 

k + 1 



tXj p tXj 1 1 ^<& > 1 ( 1 \ ^^^^^ 1 1 »^ 2, ll" 



fc(A;-l) 



The same expression holds for the more general Mahalanobis distance [MKB79J. 
In Equation [3], the lefthand-side is the square distance of x p to its neighbourhood 
centroid x, and the righthand-side becomes the distance variance in M p when k 3> 1. 
Therefore, Equation 6 can be understood as follows: The fc-nearest neighbours of 
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object x p form a "reformed" neighbourhood region, represented as a hyperball with 
radius D x , centered at x. As illustrated in Figure 12(a), since the neighbours of x p 
are scattered, it is unclear whether x p (indicated by A) belongs to its neighbourhood 
system or not. Our LDOF definition, as shown in Figure [2(b), it clearly regards x p 
as lying outside its reformed neighbourhood region. The LDOF of x p is obviously 
greater than 1, which indicates that outlier. Through this example we 

can see that LDOF can effectively capture the outlier-ness of an object among a 
scattered neighbourhood. In addition, as k grows, LDOF takes more objects into 
consideration, and the view of LDOF becomes increasingly global. If an object is far 
from its large neighbourhood system (extremely the whole dataset) it is definitely a 
genuine outlier. Hence, the detection precision of our method might be stable over 
a large range of k. In the following section, we will theoretically analyse properties 
of LDOF, and propose a heuristic for selecting the neighbourhood size k. 



4 Properties of LDOF 



Lower bound of LDOF. Ideally, we prefer a universal threshold of LDOF to 
unambiguously distinguish abnormal from normal objects (e.g. in any datasets, an 
object is outlier if LDOF > 1). However, the threshold is problem dependent due to 
the complex structure of real-world datasets. Under some continuity assumption, we 
can calculate an asymptotic lower bound on LDOF, denoted as LDOFib. LDOFib 
indicates that an object is an inlier (or normal) if its LDOF is smaller than LDOFib. 

Theorem 4 {LDOF lower-bound of outliers) Let data T> be sampled from a 
density that is continuous at x p . For iV > > 1 we have LDOFib w | with 
high probability. More formally, for k, N — ► oo such that the neighbourhood size 
D Xp — > we have 

d x 1 

LDOFib = -=r^- > - with probability 1 

Jsp 

The theorem shows that when LDOF « ~, the point is squarely lying in a uniform 
cloud of objects, i.e. it is not an outlier. The lower-bound of LDOF provides a 
potential pruning rule of algorithm complexity. In practice, objects can be directly 
ignored if their LDOFs are smaller than |. Remarkably, LDOFib does not depend 
on the dimension of X. This is very convenient: data often lie on lower- dimensional 
manifolds. Since locally, a manifold is close to an Euclidian space (of lower dimen- 
sion), the result still holds in this case. Therefore, we do not need to know the 
effective dimension of our data. 

Proof sketch. Consider data sampled from a continuous density (e.g. Gaussian or 
other standard distributions). For fixed k, as sample size N goes to infinity, the size 
of the fc-nearest neighbours region tends to zero. Locally any continuous distribution 
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is approximately uniform. In the following we assume a uniform density around x p . 
The achieved result then generalizes to arbitrary distributions continuous at x p by 
taking the limit iV — > oo. 

Without loss of generality, let x p = 0. Fix some sufficiently small radius r > and 
let B r be the ball of radius r around 0. By assumption, data T> is locally uniformly 
distributed, which induces a uniform distribution in B r , i.e. all G M p are uniformly 
distributed random variables in B r . Hence their expected value IE[xj] = 0. This 
implies 

2 



k J2 E[\\xj\\ 2 ] = 2E[4J- 

In the first equality we simply expanded the square in the definition of D x , where ■ 
is the scalar product. In the second equality we used E[xj • Xy] = E[xj] • = 

for i ^ i' . The last equality is just the definition of d Xp for x p = 0. Taking the ratio 
we get 

E[4J/IE[AJ = 1/2. 

Note that the only property of the sampling distribution we used was = 0, 

i.e. the result holds for more general distributions (e.g. any symmetric distribution 
around x p = 0). 

Using the central limit theorem or explicit calculation, one can show that for large 
k and N, the distributions of d Xp and D Xp concentrate around their means ] 
and EfZL ], respectively, which implies that d Xp /D Xp w 1/2 with high probability. 

This also shows that for any sampling density continuous at x p (since they are 
locally approximately uniform), d Xp /D Xp — > | holds, provided D Xp — > 0. We skip 
the formal proof. ■ 



False-detection probability. As discussed in Section dj in real-world datasets, 
it is hard to set parameters properly by trial-and-error. Instead of requiring prior 
knowledge from datasets (e.g. outlier labels), we theoretically determine the false- 
detection probability, given neighbourhood size k. 

Theorem 5 (False-detection probability of LDOF) Let data V be uniformly 
distributed in a neighbourhood of x p containing k objects J\f p . For LDOF threshold 
c > \, the 'probability of false detecting x p G M d as an outlier is exponentially small 
in k. More precisely, 

P[LDOF k (x p ) > c] < e~ a ( fe - 2 \ where a := |(1 - ^) 2 (^) 2 

The bound still holds for non-uniform densities continuous in x p , provided N 3> k. 
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In particular, for c = 1 in high- dimensional spaces (d — > oo) we get a — > ^. So for 
k ^> 50 the false-detection probability is very small. Note that because the bound 
is quite crude, we can expect good performance in practice for much smaller k. On 
the other hand, choosing c ~ | degenerates the bound (i.e. a — > 0), consistent with 
Theorem HI 

Proof sketch. We follow the notation used in the proof of Theorem |H We consider 
a uniform data distribution first. For x p = and dropping p, we can write the 
distances as 



— — — — 2 k , — — _o \ _9 1 1 1 x 1 1 2 — O X ^ I I 

/Y np£i I 1 [ nn A T* \ ry* • X rf* , I <"y».Z ■ X ry> 

\jj — ju J J — , y^Xj -Xi j j iXj . — 1 1 7 J I I 5 * — j / 1 1 

1 rC , K 



J I 



For Xj uniformly distributed in ball B r := {x : ||x|| < r}, one can compute the mean 
square length explicitly: 



°'~ LII jIN ~ Volume(S r ) " £ r^dr ~ d + 2 



where d is the dimensionality of x = Xj € X = M d . The first equality is just the 
definition of a uniform expectation over B r . The second equality exploits rotational 
symmetry and reduces the <i-dimensional integral to a one-dimensional radial inte- 
gral. The last equality is elementary. The expected values of x 2 and x 2 , respectively, 
are 



JElx 



21 



-^lEtH^II 2 ] = a (6) 



k 



^ = p£ E fo-*i'] = ^E E h 2 ] = l a ( 7 ) 



where we have exploited IE[xj • ay] = • = for j 7^ j'. By rearranging 

terms, we see that 



k — l 

d > cD x 2 > 7X 2 , where f < 7 := 1 : — < 1 (c > ^ 

k 2kc K 2 ' 



Thus we need (bounds on) the probabilities that x 2 and x 2 deviate (significantly) 
from their expectation. For any (vector-valued) i.i.d. random variables Xi,...,x k 
and any function f(xi, ■ ■■,x k ) symmetric under permutation of its arguments, Mc- 
Diarmid's inequality can be written as follows: 

Let A > A' := sup {sup f(x 1 , x k ) — inf f(xi, x k )}, then 



X2..X k XI 



Xl 



P[f(x u ...,x k )-1E[f(xi,-iXk)}>t} < exp{-2t 2 /kA} V*>0 

For fi := x 2 an elementary calculation using Xj G B r gives A[ = 4(k — l)r 2 /k 2 . 
For / 2 := x 2 we get A' 2 = r 2 /k straightforwardly. Now consider the real quantity 
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of interest: f(xi,...,Xk) '■= x 2 — 7X 2 . Combining the ranges, we can bound A' < 
+ 7A2 < 5r 2 /k =: A. The expectation of / is IE[x 2 — ■yx 2 } — |a — 7a. Let 
t := 0(7 - i) > 0. Then using McDiarmid's inequality we get 

P[d>cD] = P[x 2 > 7a 3 ] = P[(x 2 -7a 3 ) -E[x 2 -7a 3 ] > t] 
< exp{-2t 2 /£;A 2 } < exp{-ct(£; - 2)} 

The last inequality follows from 

2t 2 2a 2 k / 1\2 2k/ d \ 2 
fcA 2 " ~ 25^V~k) ~ 25 Vd + 2/ 

where we have inserted A, a, and 7, and used fc(l — \) 2 > k — 2 and a from the 
theorem. This proves the theorem for uniform distribution. 

An analogous argument as in the proof of Theorem H] shows that the result still 
holds for non-uniform distributions if iV — >• 00, since a continuous density is locally 
approximately uniform. ■ 




5 LDOF Outlier Detection Algorithm and Its 
Complexity 

Top-n LDOF. Even with the theoretical analysis of the previous section, it is still 
hard to determine a threshold for LDOF to identify outliers in an arbitrary dataset. 
Therefore we employ top-n style outlier detection, which ranks the n objects with 
the highest LDOFs. The algorithm that obtains the top-n LDOF outliers for all 
the iV objects in a given dataset T> is outlined in Algorithm 1. 

How to choose k. Based on Theorem^ it is beneficial to use a large neighbourhood 
size k. However, too large k will lead to a global method with the same problems as 
top-n KNN outlier. For the best use of our algorithm, the lower bound of potentially 
suitable k is given as follows: If the effective dimension of the manifold on which 
T> lies is m, then at least m points are needed to 'surround' another object. That 
is to say a k > m is needed. In Section El we will see that, when k increases to 
the dimension of the dataset, the detection performance of our method rises, and 
remains stable for a wide range of k values. Therefore, the parameter k in LDOF is 
easier to choose than in other outlier detection approaches. 

Algorithm complexity. In Step 1, querying the fc-nearest neighbours, takes the 
majority of the computational load. Naively, the runtime of this step is 0(N 2 ). If 
a tree-based spatial index such as A-tree or i?*-tree is used jBKNSOOj IBKNS99] . 
the complexity is reduced to O(NlogN). Step 2 is straightforward and calculates 
LDOF values according to Definition [3J As the k-nn query is materialised, this step 
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Algorithm 1 Top-n LDOF (Top-n Local Distance-based Outlier Factor) 
Input: A given dataset T>, natural numbers n and k. 

1. For each object p in V, retrieve p's fc-nearest neighbours; 

2. Calculate the LDOF for each object p. 

The objects with LDOF < LDOF^ are directly discarded; 

3. Sort the objects according to their LDOF values; 

4. Output: the first n objects with the highest LDOF values. 



is linear in N. Step 3 sorts the N objects according to their LDOF values, which 
can be done in O(iVlogiV). Since the objects with LDOF < LDOFib are flushed 
(i.e. they are definitely non-outliers), the number of objects needed to sort in this 
step is smaller than N in practice. Finally, the overall computation complexity of 
Algorithm 1 is O(NlogN) with appropriate index support. 

6 Experiments 

In this section, we compare the outlier detection performance of top-n LDOF with 
two typical top-n outlier detection methods, top-n KNN and top-n LOF. Experi- 
ments start with a synthetic 2-D dataset which contains outliers that are meaningful 
but are difficult for top-n KNN and top-n LOF. In Experiments 2 and 3, we identify 
outliers in two real-world datasets to illustrate the effectiveness of our method in 
real-world situations. For consistency, we only use the parameter k to represent the 
neighbourhood size in the investigation of the three methods. In particular, in top-n 
LOF, the parameter MinPts is set to neighbourhood size k as chosen in the other 
two methods. 

Synthetic Data. In Figure [H(b), there are 150 objects in cluster Ci, 50 objects in 
cluster C 2 , 10 objects in cluster C 3 , and 4 additional objects {oi, o 2 , o 3 , o 4 } which 
are genuine outliers. We ran the three outlier detection methods over a large range 
of k. We use detection precision 2 to evaluate the performance of each method. In 
this experiment, we set n = 4 (the number of real outliers). The experimental result 
is shown in Figure [3]^a). The precision of top-n KNN becomes when the k is larger 
than 10 due to the effect of the mini-cluster C3 as we discussed in Section |2j For 
the same reason, the precision of top-n LOF dramatically descends when k is larger 
than 11. When the k reaches 13, top-n LOF misses all genuine outliers in the top-4 
ranking (they even drop out of top-10). On the contrary, our method is not suffering 
from the effect of the mini-cluster. As shown in the Figure [3]^a), the precision of our 

2 Precision= n roa i_ out ij ers ; n top-n/"- We set n as the number of real outliers if possible. 
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Figure 3: Detecting precisions of top-n LDOF, top-n KNN and top-n LOF on (a) 
Synthetical dataset, (b) WDBC dataset. 



approach keeps stable at 100% accuracy over a large neighbourhood size range (i.e. 
20-50). 

Medical Diagnosis Data. In real-world data repositories, it is hard to find 
a dataset for evaluating outlier detection algorithms, because only for very few 
real-world datasets it is exactly known which objects are really behaving differ- 
ently [KSZ08j . In this experiment, we use a medical dataset, WDBC (Diagnosis) 1 , 
which has been used for nuclear feature extraction for breast tumor diagnosis. The 
dataset contains 569 medical diagnosis records (objects), each with 32 attributes 
(ID, diagnosis, 30 real- valued input features). The diagnosis is binary: 'Benign' and 
'Malignant'. We regard the objects labeled 'Benign' as normal data. In the experi- 
ment we use all 357 'Benign' diagnosis records as normal objects and add a certain 
number of 'Malignant' diagnosis records into normal objects as outliers. Figure E^b) 
shows the experimental result for adding the first 10 'Malignant' records from the 
original dataset. Based on the rule for selecting neighbourhood size, k, suggested 
in Section HI we set k > 30 in regards to the data dimension. We measure the 
percentage of real outliers detected in top-10 potential outliers as detection preci- 
sion 2 . In the experiments, we progressively increase the value of k and calculate the 
detection precision for each method. As shown in Figure [3](b), the precision of our 
method begins to ascend at k = 32, and keeps stable when k is greater than 34 with 
detection accuracy of 80%. In comparison, the precision of the other two techniques 
are towed over the whole k value range. 

To further validate our approach, we repeat the experiment 5 times with a differ- 
ent number of outliers (randomly extracted from 'Malignant' objects). Each time, 
we perform 30 independent runs, and calculate the average detection precision and 
standard deviation over the k range from 30 to 50. The experimental results are 
listed in Table [TJ The bold numbers indicate that the detection precision vector 
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Table 1: The detecting precision for each method based on 30 independent runs. 



Number of outliers 


Precision (mean ± std.) 


LDOF 


LOF 


KNN 


1 


0.29±0.077 


0.12±0.061 


0.05±0.042 


2 


0.33±0.040 


0.13±0.028 


0.11±0.037 


3 


0.31±0.033 


0.22±0.051 


0.22±0.040 


4 


0.35±0.022 


0.27±0.040 


0.26±0.035 


5 


0.38±0.026 


0.28±0.032 


0.28±0.027 




5 15 25 35 45 

Neighbourhood size k 



Precision (mean ± std.) 


LDOF 


LOF 


KNN 


0.25±0.081 


0.03±0.057 


0.08±0.114 



Figure 4: & TabJU Outlier detection pre- 
cision over different neighbourhood size 
for Shuttle dataset based on 15 indepen- 
dent runs. 



over the range of k is statistically significantly improved compared to the other two 
methods (paired T-test at the 0.1 level). 



Space Shuttle Data. In this experiment, we use a dataset originally used for 
classification, named Shuttle 3 . We use the testing dataset which contains 14500 
objects, and each object has 9 real- valued features and an integer label (1-7). We 
regard the (only 13) objects with label 2 as outliers, and regard the rest of the six 
classes as normal data. We run the experiment 15 times and each time we randomly 
pick a sample of normal objects (i.e. 1,000 objects) to mix with the 13 outliers. The 
mean values of detection precision of the three methods are presented in Figure HI 
As illustrated in Figure HI top-n KNN has the worst performance (rapidly drops to 
0). Top-n LOF is better, which has a narrow precision peak (k from 5 to 15), and 
then declines dramatically. Top-n LDOF has the best performance, as it ascends 
steadily and keeps a relative high precision over the k range from 25 to 45. Table H] 
shows the average precisions for the three methods over 15 runs. The bold numbers 
indicate that the precision vector is statistically significantly improved compared to 
the other two methods (paired T-test at the 0.1 level). 



3 The Shuttle dataset can also be downloaded from UCI ML Repository. 
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7 Conclusion 



In this paper, we have proposed a new outlier detection definition, LDOF. Our def- 
inition uses a local distance-based outlier factor to measure the degree to which an 
object deviates from its scattered neighbourhood. We have analysed the properties 
of LDOF, including its lower bound and false-detection probability. Furthermore, a 
method for selecting k has been suggested. In order to ease the parameter setting in 
real-world applications, the top-n technique has been used in this approach. Experi- 
mental results have demonstrated the ability of our new approach to better discover 
outliers with high precision, and to remain stable over a large range of neighbour- 
hood sizes, compared to top-n KNN and top-n LOF. As future work, we are looking 
to extend the proposed approach to further enhance the outlier detection accuracy 
for scattered real-world datasets. 
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